1 Tuesday, July 23 2019

1.1 Gentle Introduction to Kaggle Competitions

Example: The following code will take you through your first Kaggle competition submission: House Prices: Advanced Regression Techniques.

## [1] 180921.2

Exercise: You will now make a submission to Kaggle using a linear regression model. Do the following:

  1. Perform an exploratory data analysis of the variable and its relationship with SalePrice.
  2. Make a Kaggle submission.

Hint: Here is code that will allow you to fit a linear regression model to a training set and then apply it to the test set to get predictions

Solutions:

1.2 Iris dataset

The iris dataset is a canonical dataset in statistics and machine learning (Wikipedia). Introduced by Fisher in his 1936 paper “The use of multiple measurements in taxonomic problems as an example of linear discriminant analysis.” For 150 iris flowers it has 5 variables.

As categorical outcome variable \(y\) one of three species of iris flower:

Setosa Versicolor Virginica
Drawing Drawing Drawing

Four numerical predictor variables \(\vec{x}\): Sepal length, sepal width, petal length, and petal width, all in cm.

Drawing


Exercise: We’ll now visually build a model to predict Species using two predictor variables Sepal.Length and Sepal.Width

  1. Preprocess the iris data and perform an EDA.
  2. Based on the final visualization, try to create a model for predicting Species by partioning the cartesian plane with lines i.e. cutting it up into pieces.
  3. If you have time, repeat this exercise using Petal.Length and Petal.Width

1.3 Classification and Regression Trees

Example: The following code create a visualization of a fitted CART model

How does this CART model translate to our EDA from above?

Example: The following code shows you how to fit a CART model to training data and apply the fitted model to test data to obtain predictions

Exercise: Make a submission to the House Prices: Advanced Regression Techniques competition by fitting a CART model. Note set type="vector" in the predict() because we are now dealing with a numerical outcome variable.


2 Wednesday, July 24 2019

2.1 RMSLE

Root Mean Squared Error (RMSE) vs Root Mean Squared Logarithmic Error (RMSLE):

For more on log and log10 transformations, see:

2.2 CART Shiny App

Go over the CART.Rmd Shiny App file in this RStudio Project folder. It delves deeper into the role of the “complexity parameter” cp in CART.

2.3 Demonstration of overfitting

Example: Recall the model I showed at the end of Tuesday: a CART model with complexity cp = 0 (i.e. maximal complexity) which got a Kaggle score of 0.21878. We’ll now show that cp = 0 leads to CART models that are wayyyyy overfit. We’ll do this by fitting a CART model to train and also evaluating the CART model’s performance via the RMSLE on train. In other words, you’ll fit and evaluate the model using the same data.

library(tidyverse)
library(rpart)
# For ggpairs() function below:
library(GGally)

# Reload house prices data
train <- read_csv("https://rudeboybert.github.io/SDS293/static/train.csv")
test <- read_csv("https://rudeboybert.github.io/SDS293/static/test.csv")

# First: EDA visulization of all pairwise relationships between numerical
# variables:
train %>% 
  select(SalePrice, GrLivArea, HalfBath, YearBuilt) %>% 
  ggpairs()

# Fit CART model to training. Note the cp = 0
model_CART_3 <- rpart(SalePrice ~ GrLivArea + HalfBath + YearBuilt, 
                      data = train, 
                      control = rpart.control(cp = 0))

# Visualize fitted tree! Lots of complexity!!
plot(model_CART_3, margin = 0.25)
text(model_CART_3, use.n = TRUE)
title("Very complex CART model for sale price")
box()

# Apply the model to get predictions on the training data, not the test data.
# In other words, we are fitting our model AND evaluating the performance of the
# model on the same data.
y_hat_3 <- predict(model_CART_3, type="vector", newdata = train)

# Compute the RMSLE using dplyr data wrangling. Write your resulting RMSLE down
# on a piece of paper. Note while we explicitly compute the RMSLE here, in the 
# future you'll use built-in functions.
train %>% 
  mutate(
    # Add variable of predicted sale prices:
    SalePrice_hat = y_hat_3,
    # Add variable of logarithmic error for each house (i.e. each row):
    le = log(SalePrice + 1) - log(SalePrice_hat + 1),
    # Add variable of squared logarithmic error for each house (i.e. each row):
    sle = le^2
    ) %>% 
  # Summarize the rows by computing the mean squared logarithmic error:
  summarize(msle = mean(sle)) %>% 
  # Add variable of root mean squared logarithmic error:
  mutate(rmsle = sqrt(msle))

Exercise: Now, instead of fitting and evaluating the model using the same train data as above, you’ll now fit and evaluate the model using different data. You’ll do this by articifically creating two separate data sets by randomly splitting train into two: train_validation and test_validation. This is the validation set approach; think of this as baby’s first building blocks towards cross-validation.

Questions:

  1. Compare your two RMSLE scores. Which one is bigger? The RMSLE when you fit/evaluate the model on the same data? Or the RMSLE when you fit/evaluate the model on different data?
  2. Which of these two RMSLE scores is closer to the actual RMSLE score returned by Kaggle of 0.21878.

Bonus: Repeat the above comparison for a value of the complexity parameter cp other than 0. In other words, a value of cp that will yield a less complex tree, as quantified by the number of leaves it has.

2.4 Cross-validation

In this exercise, we’ll code up a 5-fold crossvalidation algorithm from scratch. Note that while we are doing this from scratch here, once you get the idea behind cross-validation, in the future you’ll use built-in functions.

Specifically, we’ll get an estimate of the RMSLE on new independent data for a CART model with complexity parameter cp = 0 i.e. maximal complexity.

We start by randomly assigning each row in train to a fold (there are many other ways we could do this too):

fold n
1 292
2 292
3 292
4 292
5 292
## [1] 0 0 0 0 0

Let’s now perform cross-validation on the first fold only:

## [1] 0.2168692 0.0000000 0.0000000 0.0000000 0.0000000

Exercise: Here is an example of a for loop in R:

Using the for loop code above, perform model fitting/predicting on all 5 folds and then compute the overall estimated RMSLE for cp = 0:

Bonus: Modify the above code so that you hardcode the number of folds k once (k <- 5 for example) and have the rest of the code adjust accordingly.

2.5 Picking “optimal” complexity

Example: Let’s now compute the estimated RMSLE using the three values of the complexity parameter cp I played around with yesterday: 0, 0.2, 0.5. Do this by manually changing the cp value in the solutions to the above cross-validation exercise.

Question: Of these three values of cp, which one is “optimal”?

Exercise:

  1. Using a second outer for loop, evaluate the estimated RMLSE for a large number of complexity parameter values cp, instead of just the three from above.
  2. Plot a scatterplot with cp on the x-axis and estimated RMLSE on the y-axis. Visually pick which cp value you think is optimal.
  3. Use this optimal cp to make a Kaggle submission.

2.6 Review of Multiple Linear Regression

Go over the regression.Rmd Shiny App in this RStudio Project folder. It is a refresher of Multiple Linear Regression using the Credit card debt dataset from the ISLR package from the “An Introduction to Statistical Learning with Applications in R” textbook available as a free PDF here.


3 Thursday, July 25 2019

3.1 LASSO regularization

Go over the LASSO.Rmd Shiny App file in this RStudio Project folder. It introduces the ideas behind the LASSO, in particular the same idea of a “complexity parameter” that balances between “model fit” and “model complexity”.

Exercise: The code below runs LASSO. Modify it to make a Kaggle submission in the House Prices: Advanced Regression Techniques competition.

library(tidyverse)
library(glmnet)
library(broom)
library(modelr)
library(ISLR)

# 1. Create training and test data
set.seed(76)

# Clean up original Credit data from ISLR package
credit <- Credit %>%
  as_tibble() %>%
  select(Balance, Income, Limit, Rating, Cards, Age, Education, Married) %>% 
  rownames_to_column(var="ID")

# Create training set
credit_train <- credit %>%
  sample_frac(0.5)

# Create test set
credit_test <- credit %>%
  anti_join(credit_train, by = "ID") %>%
  # Remove outcome variable to make a Kaggle competition-like test set
  select(-Balance)


# 2. Define model formula and create corresponding model matrices ----
model_formula <-
  "Balance ~ Income + Limit + Rating + Cards + Age + Education + Married" %>%
  as.formula()

y_train <- credit_train$Balance
x_matrix_train <- credit_train %>%
  model_matrix(model_formula, data = .) %>%
  select(-`(Intercept)`) %>%
  as.matrix()

x_matrix_test <- credit_test %>%
  # Since model_formula has the outcome variable Balance in it, model_matrix()
  # needs this variable to create the model matrix. Since credit_test doesn't
  # have it, we simply create dummy outcome variable to get around this issue:
  mutate(Balance = 1) %>%
  model_matrix(model_formula, data = .) %>%
  select(-`(Intercept)`) %>%
  as.matrix()


# 3. For given search grid of lambdas, plot LASSO trajectory. Note:
# - While this plot is quick to create, IMO it's not as informative
# - The x-axis is not lambda in log-scale, rather log(lambda)
# - The numbers correspond to columns in x_matrix_train
lambda_inputs <- 10^seq(from = -5, to = 3, length = 100)
LASSO_fit_train <- glmnet(
  x = x_matrix_train, 
  y = y_train, 
  alpha = 1, 
  lambda = lambda_inputs
)
plot(LASSO_fit_train, xvar = "lambda", label = TRUE)

## [1] -2.395619
## 8 x 1 sparse Matrix of class "dgCMatrix"
##                         1
## (Intercept) -499.53107808
## Income        -7.52336114
## Limit          0.07571309
## Rating         2.82104482
## Cards          9.97976427
## Age           -1.42106406
## Education      5.28899348
## MarriedYes   -19.26395180
##   [1]  470.373946  541.965644 1057.728483  446.615548 1095.070581 1106.157952
##   [7]  346.611154  505.659236  281.806302  991.999054  538.400325   21.037402
##  [13] 1805.618652  929.405009  817.998237  693.169562 -109.114736  761.347122
##  [19] 1018.977732  230.396965  494.722760  982.396619 1126.280684 1085.542914
##  [25]  790.435467  413.198302  700.416339  419.500309 -130.582168  326.127729
##  [31]  625.975441  290.097329 -204.918558 1085.505149  487.224999 1085.445908
##  [37]  691.155417  170.281435  225.465299  495.172517  369.591108  286.522006
##  [43] 1629.914660  117.055911  684.475857 1157.286445  907.813649  222.314786
##  [49]  950.193219  316.680575  128.991282 1303.419408  100.526138  800.005673
##  [55]  510.962871  213.140882  579.822043  -52.724199 -179.963075   44.827168
##  [61]  423.502681  937.165550  303.277768  -55.983566 1472.397109  829.569007
##  [67]  690.859901  817.511209 -171.248943  725.263423  201.483378   29.349553
##  [73] -120.044485  531.823265  623.593716   12.795429  864.426720  167.500613
##  [79]  607.171751  -30.227788  128.542597 1090.510906  695.782973 1545.562840
##  [85]  210.749942  482.476399 1224.342348  582.496901  792.568640 1445.992344
##  [91]  400.764857  -46.506488  616.110668  408.533795 1439.679686  832.785371
##  [97]  268.253636  844.342021 1016.662212  570.022614  725.824187   -4.795324
## [103]  791.214222  873.688857  739.381861  461.607928  239.558893  647.239328
## [109]  855.731078 1205.507746 1016.145807  805.559096 1036.194124  325.342891
## [115] 1039.661760  724.967922  782.324801   81.143872  367.593427  585.092629
## [121]  311.157647  273.600999 -163.369563  -93.809026 -265.558961  282.213565
## [127] 1336.208986  403.009732  972.504839  242.020598   13.010796 1015.164602
## [133]  682.072185  528.085621 -205.463146  832.261659  250.160048  458.302453
## [139]  792.293118  935.822584  740.665079  343.062569  475.139245  -63.167467
## [145] 1294.337986  971.326124  -46.889829  -86.043992   12.486225  574.594244
## [151]  281.020531  -22.649073 -190.893518  448.657918  318.961699  127.556389
## [157]  359.932503  449.783034  -68.719839  194.789520  680.643502  916.189244
## [163]  883.026325  556.571947  285.743790 -183.588627 -210.135789  738.086998
## [169]  544.885181  342.367984  674.383265  441.968638  373.525725  710.641630
## [175]  983.870672 -179.859350  555.712475  550.049741  -67.012782  529.370631
## [181]  632.999368 1208.754559  513.692275  331.634133 1058.213110 -130.316831
## [187]  681.754284  990.017733  805.811738  901.366247  894.245329  463.571091
## [193]  878.502304  455.921145 -176.256149 -143.077166  743.823439  619.908959
## [199]  -53.610393  944.720902

Example Kaggle submission:


4 Friday, July 26 2019

4.2 Bias-Variance Trade-off

Recall the following slide from the presentation on Tuesday:

Why does the orange curve (the model error when you fit a model to training but evaluate it’s prediction performance on new independent test data) have that U-shape? Because of the “bias-variance” trade-off.

\[ \mbox{MSE}\left[\widehat{f}(x)\right] = \mbox{Var}\left[\widehat{f}(x)\right] + \left(\mbox{Bias}\left[\widehat{f}(x)\right]\right)^2 + \sigma^2 \]

where \(\widehat{y} = \widehat{f}(x)\) and \(y=f(x)+\epsilon\) with \(\mathbb{E}\left[\epsilon\right] = 0\) and \(\mbox{Var}[\epsilon] = \sigma\). For more info, read this blog post.